Appendix C — Assignment C

Instructions

  1. You may talk to a friend, discuss the questions and potential directions for solving them. However, you need to write your own solutions and code separately, and not as a group activity.

  2. Do not write your name on the assignment.

  3. Write your code in the Code cells and your answer in the Markdown cells of the Jupyter notebook. Ensure that the solution is written neatly enough to understand and grade.

  4. Use Quarto to print the .ipynb file as HTML. You will need to open the command prompt, navigate to the directory containing the file, and use the command: quarto render filename.ipynb --to html. Submit the HTML file.

  5. The assignment is worth 100 points, and is due on Thursday, 9th February 2023 at 11:59 pm.

  6. Five points are properly formatting the assignment. The breakdown is as follows:

  • Must be an HTML file rendered using Quarto (1 pt). If you have a Quarto issue, you must mention the issue & quote the error you get when rendering using Quarto in the comments section of Canvas, and submit the ipynb file.
  • No name can be written on the assignment, nor can there be any indicator of the student’s identity—e.g. printouts of the working directory should not be included in the final submission (1 pt)
  • There aren’t excessively long outputs of extraneous information (e.g. no printouts of entire data frames without good reason, there aren’t long printouts of which iteration a loop is on, there aren’t long sections of commented-out code, etc.) (1 pt)
  • Final answers of each question are written in Markdown cells (1 pt).
  • There is no piece of unnecessary / redundant code, and no unnecessary / redundant text (1 pt)

C.1 Model assumptions

C.1.1

Using house_feature_train.csv and house_price_train.csv, fit a multiple linear regression model without transformation to predict house_price based on distance_MRT, latitude, and longitude, house_age, and number_convenience_stores. Print the model summary. What is the model R2R^2?

(1 + 1 points)

C.1.2

Obtain the residuals and plot them separately against fitted values and each of the five feature variables. Make one plot including the 6 subplots.

(4 points)

C.1.3

Comment on the plot of residuals against fitted values. Does the model violate the assumption of linearity? Does the model violate the constant variance assumption?

(2 + 2 points)

C.1.4

Comment on the plot of residuals against the predictor variables. On the basis of these plots, should any further modifications of the regression model be attempted?

(5 points)

C.1.5

Calculate the RMSE using the test datasets for the model constructed in the first question. The test datasets are house_feature_test.csv and house_price_test.csv.

(2 points)

C.1.6

Using appropriate transformation(s) and/or variable interaction(s), update the model to obtain a model that has a R2R^{2} of at least 80%, and a RMSE (Root mean squared error) of at max $350k on test data.

Print the model summary and report the R2R^2, and RMSE on test data. Note:

  1. House prices are provided in thousands of dollars. A value of 556 in the house_price column indicates a house price of $556k.

  2. The test datasets are house_feature_test.csv and house_price_test.csv.

  3. R2R^2 is computed on training data, and RMSE is computed on test data.

  4. You must proceed logically, i.e., justify every transformation that you introduce into the model to improve it. If you are introducing interactions, there should be some rationale behind considering only certain interactions, unless you are considering all possible interactions.

(12 points for achieving the objectives + 8 points for justifications)

C.1.7

Are the assumptions of linearity and constant variance of errors satisfied in the model developed in the previous question? Make a scatterplot between the residuals and fitted values and use it to answer the question.

(4 points)

C.2 Multicollinearity and Outliers

The datasets Austin_Affordable_Housing_Train.csv and Austin_Affordable_Housing_Test.csv provide data on housing development projects that have received funding from the Affordable Housing Development Fund in Austin, Texas. The city provides property developers with tax credits and other forms of funding in exchange for agreements to set housing prices (e.g. rent) below market rate.

Each row represents a housing development in Austin. Variables include the amount (USD) provided by the city, the status of the housing project, the number of housing units, the period of affordability, and more.

Let’s say that you’re hired by the city as a consultant to work with subject matter experts in their Housing and Planning Department.

General Hint: For written sections, writing “it depends” (along with an explanation) often characterizes a good answer.

Note for Grading Team: Written answers should be given full credit as long as they’re thoughtful answers that address the question fully, base findings on relevant data/results, and align with the relevant regression theory/thinking. Many questions don’t have a single right answer and/or depend on context that isn’t provided here.

C.2.1

Suppose you run the code status_vars = pd.get_dummies(housing_dataframe["Status"]), append the columns of status_vars to your original data frame, and use the columns as predictors in a linear regression model. What potential problem would you likely be introducing into the model? How could it affect your results?

(4 points)

C.2.2

Suppose that a subject matter expert recommends using the variables Total_Units, Total_Affordable_Units, Total_Accessible_Units, and Market_Rate_Units as predictors in your model. From a regression modeling standpoint, does this sound advisable? Produce metrics to quantify the potential impact of including the four predictors in a model. Interpret at least one of the metrics you provide, both statistically and in the context of the problem.

(4 points)

C.2.3

Say that the subject matter expert agrees to use Total_Affordable_Units, Affordability_Expiration_Year, and Units_Under_50_Percent_MFI as predictors for City_Amount. Fit the appropriate model (without transformations). Then interpret the results associated with Total_Affordable_Units, and comment on the overall model fit.

(4 points)

C.2.4

Using visualizations, investigate whether the model you fit in in the previous question yields outlying observations. What count and proportion of observations would you classify as outliers?

Note: Show separate plots for both - residuals and studentized residuals. However, consider studentized residuals when identifying outliers.

(4 points)

C.2.5

Based on your results in the previous question, would you choose to remove outlying observations? Why or why not?

(4 points)

C.2.6

Consider a scenario in which the model will be used by property owners seeking to predict the amount of money they may receive from the city of Austin. How would this change, support, or complicate your answer in the previous question, if at all?

(3 points)

C.2.7

Say that the model will be used by a team of sociologists seeking statistical evidence at the α=0.01\alpha = 0.01 significance level that a property’s affordability expiration year has an effect on the amount of money issued by the city of Austin? How would this change, support, or complicate your answer in C.2.5, if at all?

(3 points)

C.2.8

Determine whether the model you fit in C.2.3 contains any high-leverage points. Produce a visualization, then report the count and proportions of observations that are high-leverage (define an observation as “high-leverage” if its leverage is greater than four times the average leverage of all observations).

(4 points)

C.2.9

Based on your results in the previous question, would you choose to remove high-leverage observations? Why or why not?

(3 points)

C.2.10

Identify and remove any influential points from the training data and refit the model. How does removing influential affect the model, if at all?

Think about using the model summary, and use the test data provided.

(6 points)

C.3 Autocorrelation

Refer to the autocorrelation example in the class notes. Predict the power consumption for each hour of each day of the year 2020. For predicting a power consumption on a particular hour of a day, use all the data you have until the previous day. However, don’t use any data of the day on which you are making the predictions. For example, for making 24 predictions for each hour of 4th April, 2020, use all the data upto 3rd April 2020. Make the predictions using four different models:

  1. Model with only temp_hot and temp_cold as the predictors

  2. Model including one day lag of power as a predictor in addition to the predictors in model (1)

  3. Model including one week lag of power as a predictor in addition to the predictors in model (2)

  4. Model including two weeks lag of power as a predictor in addition to the predictors in model (3)

For each model:

  1. Report the RMSE for the predicted power in 2020. You should have 366 x 24 = 8784 predicted values of power for each model.

  2. Make a scatterplot of predicted power vs actual power (use color = ‘orange’). Plot the line x = y over the scatterplot (use color = ‘blue’).

Which model makes the most accurate predictions?

(4 points for developing the models + 4 points for computing the predictions + 4 points for computing the RMSEs + 2 points for the visualizations + 1 point for identifying the most accurate model)